Purpose

During 2020-2021, many hockey leagues (including NHL feeder leagues) had shortened seasons or no season due to the restrictions on play caused by the COVID-19 pandemic. Some players experiencing restrictions on play in played in other leagues or tournaments during the 2021-2022 season. Others did not play any league/tournament games during the 2020-2021 season. This poses the question of whether not playing games during the 2020-2021 COVID season negatively impacted player development (or caused players to get worse). To answer this question, we will examine data from the Ontario Hockey League, which did not play any games during the 2020-2021 season. Some players from this league chose to play in other leagues or tournaments while others did not.

Data

Wrangling:

Filtering:

  • 2019-2020 (“pre-COVID”) and 2021-2022 (“post-COVID”) seasons

  • league == OHL

  • Only players who played in the OHL during both pre- and post-COVID seasons

Variables added:

  • points per game per season (combined if a player played for multiple teams in a season)

  • games played per season (combined if a player played for multiple teams in a season)

  • treatment (i.e. whether a player played more than 10 games during the COVID season)

  • age (approximately the oldest a player was in a given season)

  • player quality approximated by ppg in pre-COVID season

  • whether a player was drafted (not totally up to date)

Alternatives to boxplots

  1. Jitter plot / strip plot

  2. Violin

  3. Beeswarm

  4. Density plot with rugs

EDA

Did skaters who played during the COVID season perform better than those who didn’t?

But is this difference “real”? Aka did not playing during the COVID season cause players to get worse at hockey or can this difference simply be explained by confounding variables?

EDA with variables of interest: PPG, position, age, player quality, treatment, GP, season

check if explanatory variables are correlated with each other and response.

Concerns about GP (and GP vs PPG)

Takeaways:

  • players who have inflated ppgs with low games played don’t seem to be a concern

  • There could be some kind of relationship here -> PPG seems to increase with GP

Takeaways:

  • Likely because players get more skilled as they get older, and we are only including players who played both pre- and post-COVID

  • Something to control for in our model

Position

Takeaways:

  • Forwards score more than defensemen, which is obvious.

  • But, this becomes problematic with the way we’re measuring player quality…

Problems with player quality and PPG

Takeaways:

  • Forwards weighted as better players in our model becuase of our biased metric

  • Drafted vs not drafted another way to measure quality, but few players are drafted, and it can be an all or nothing way to measure quality.

Takeaways:

  • Position alone is not accounting for difference in PPG for treatment vs non-treatment

Age

What’s the distribution of age?

Does PPG increase as players get older?

Does player quality increase as players get older?

Were older players more likely to play during COVID (since older players are likely higher quality players)?

Takeaways:

  • Not really…

Do older players play more games?

Player quality

Did player quality influence whether someone played during COVID season?

Takeaways:

Drafted

If drafted or not:
## 
## Call:
## lm(formula = ppg ~ got_drafted, data = recent2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7204 -0.3837 -0.1204  0.2796  3.4210 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.57897    0.01235  46.897  < 2e-16 ***
## got_draftedYes  0.14140    0.02837   4.984 6.75e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5096 on 2100 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.01169,    Adjusted R-squared:  0.01122 
## F-statistic: 24.84 on 1 and 2100 DF,  p-value: 6.748e-07

  • Very low R-squared value, significant p-value, but we can’t use these statistics because model does not meet conditions for inference.
Draft pick number / round:
## 
## Call:
## lm(formula = ppg ~ overall_pick_num, data = drafted)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8521 -0.3705 -0.0767  0.3327  1.5990 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.9068616  0.0457513  19.822  < 2e-16 ***
## overall_pick_num -0.0019559  0.0004067  -4.809 2.16e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4843 on 396 degrees of freedom
## Multiple R-squared:  0.05518,    Adjusted R-squared:  0.0528 
## F-statistic: 23.13 on 1 and 396 DF,  p-value: 2.157e-06

  • Coefficient for ‘overall_pick_number’ is small, but we can’t use these statistics because model does not meet conditions for inference.

  • The jitter plot shows that some players drafted in the seventh round played about as many games as those drafted in earlier rounds.

Position

Defensemen have less points and lower points per game than forwards. Players from different draft rounds are intermixed.

PPG in 2019-2020

Points per game was slightly higher in the 2019-2020 season.

Modeling

Basic multiple regression model, no interaction

Does putting PPG on a log scale help meet model conditions for inference?

No…

Does allowing for all interactions between variables help us meet model conditions?

No… we need a more complex model.

Questions/Concerns